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Abstract — Peptide sequencing from mass spectrometry data is 
a liey step in proteome researcli. Especially de novo sequencing, 
tlie identification of a peptide from its spectrum alone, is still a 
challenge even for state-of-the-art algorithmic approaches. In this 
paper we present antilope, a new fast and flexible approach 
based on mathematical programming. It builds on the spectrum 
graph model and works with a variety of scoring schemes. 
ANTILOPE combines Lagrangian relaxation for solving an integer 
linear programming formulation with an adaptation of Yen's 
k shortest paths algorithm. It shows a significant improvement 
in running time compared to mixed integer optimization and 
performs at the same speed like other state-of-the-art tools. We 
also implemented a generic probabilistic scoring scheme that 
can be trained automatically for a dataset of annotated spectra 
and is independent of the mass spectrometer type. Evaluations 
on benchmark data show that antilope is competitive to the 
popular state-of-the-art programs PepNovo and NovoHMM both 
in terms of run time and accuracy. Furthermore, it offers 
increased flexibility in the number of considered ion types. 
ANTILOPE will be freely available as part of the open source 
proteomics library OpenMS. 



I. Introduction 

Mass spectrometry-based high throughput identification of 
peptides and proteins is a key step in most proteomics research 
experiments. It requires fast algorithmic solutions with good 
identification capabilities. Depending on the initial situation 
of the experiment, two general strategies exist: database- 
assisted and de novo identification. If a database for the 
studied proteins exists the first method is usually preferred 
over de novo sequencing. The crucial step in database search 
algorithms like INSPECT |1|, SEQUEST |2|, Mascot E] 
and OMSSA |4| is to filter the database based on different 
methods. INSPECT generates peptide sequence tags (PST) 
and keeps only those candidate peptides containing the tag 
as a subsequence. SEQUEST uses the parent mass as filter 
criterion. After filtering, the query spectrum is scored against 
the remaining candidates and a ranking of possible identifica- 
tions is produced. In addition to the quality of the spectrum, 
database search methods clearly depend on the correctness and 
completeness of the database and hence on the availability of a 
suitable set of peptides or transcripts for the studied organism. 
Even if this is the case, factors like alternative splice variants 
and mutations can lead to missing identifications. 
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In such situations de novo sequencing algorithms provide an 
alternative as they infer the sequence from the spectrum itself 
without any information collected in databases. In recent years, 
many algorithms and software packages were published, with 
the most popular being PEAKS |5|, PepNovo |61, NovoHMM 
L7J , Lutefisk L8J , Sherenga 1 9 1 , EigenMS [ 1 1 , and PILOT 1 1 ij . 
Most of them use the graph-theoretical approach introduced 
by B artels fT2) and construct a so-called N-C spectrum graph 
which is then used to search for the correct sequence. See 
Fig.IH 

Using this formulation, the de novo peptide sequencing 
problem can be formulated as the search for the longest 
antisymmetric path, an NP-complete problem ifTSl . PepNovo 
and Lutefisk solve a special case of this problem by restricting 
the construction of the spectrum graph, which enables them to 
apply a dynamic programming algorithm proposed by Chen et 
al. lfT4l . lITSl . The restrictions limit the possible interpretations 
of each peak to at most one N-terminal (usually b-ion) and 
one C-terminal (usually y-ion) ion type. Liu and Cai lfT6l use 
tree-decomposition to solve the restricted problem. Bafna and 
Edwards [17] propose a variant of the dynamic programming 
approach that also allows for more interpretations leading to 
a polynomial algorithm of a higher degree. Their algorithm is 
still limited to so-called simple ion types, excluding doubly 
and triply charged ions that can also aid the identification 
process. PILOT | IT] overcomes all these restrictions using an 
integer linear programming (ILP) formulation for the longest 
antisymmetric path problem that is flexible and extensible on 
the cost of efficiency. This allows for more interpretations 
of each peak which can lead to improved identification in 
situations where the prominent b- and y-ions are missing. 
Furthermore the ILP formulation can be easily extended in 
several ways by simply adding or modifying constraints to 
further restrict or modify the set of possible solutions. The 
approach also allows for global reasoning such as limiting the 
number of a certain amino acid type for each prediction. 

The main contribution of this work is an improvement 
of this approach by an extension that retains most of the 
flexibility and leads to a remarkable improvement in running 
time. Instead of focusing on computing one antisymmetric path 
we propose a novel algorithm to find the k best antisymmetric 
paths. We achieve this by applying the Lagrangian relaxation 
technique to the problem and solving the subproblems with an 
elegant variant of Yen's k shortest paths algorithm. Lagrangian 
relaxation was akeady successfully applied to biological prob- 
lems such as sequence alignment |18|, protein |19| and RNA 
11201 structural alignment or protein threading \2\\ . 

An additional contribution of this paper is a generic prob- 
abilistic scoring scheme that can be trained automatically 
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Fig. 1. Spectram graph generation, (a) Simplified tandem mass spectrum of tiie peptide VEALR. Rounded m/z values in Da are presented on top of each 
peak, (b) The corresponding spectrum graph with two nodes being generated for each peak. One under the assumption of being a b-ion, the other under 
the assumption of being a y-ion. It it obvious that the path starting at node s with mass and ending at node t with mass 568 encodes the correct peptide 
sequence. The undirected edges connecting complementary nodes are drawn as dashed lines. 



for a dataset of annotated spectra and is independent of the 
mass spectrometer type. The performance of both, de novo 
and database search approaches, depends on a good scoring 
function to model prediction quaUty. Currently used scor- 
ing functions range from rather simple peak intensity-based 
scoring to statistical models including Bayesian networks. 
The latter show a better performance but require re-training 
for different spectrometer types and thus depend on reliable 
annotated datasets. Our flexible scoring scheme allows for 
user controlled training on supplied annotated datasets. The 
topology of the network can either be defined by the user 
or, following the approach proposed by Bern [22], learned 
from the given dataset directly. We extend this approach by 
considering ion intensities and cleavage positions similar to 
the PepNovo scoring in order to account for shifts of the 
fragmentation patterns between different m/z regions along 
the spectrum. 

Our software ANTILOPE (ANTlsymmetric path search with 
Lagrangian Optimization for PEptide identification), an imple- 
mentation of the improved approach, is freely available as part 
of upcoming releases of the open source proteomics library 
OpenMS ll23]l . 

The structure of the remainder of this paper is as follows. 
Section describes our new method. In Section |III] we 
compare our tool with the state-of-the-art tools PepNovo, 
NovoHMM, LutefiskXP and PILOT. Finally, in Section [IVl 
we discuss our results and future work. 



II. Novel De Novo Peptide Sequencing Algorithm 

This section describes our new approach to the de novo 
sequencing problem. At first we formally introduce the graph - 
theoretic formulation and the resulting ILP formulation our 
method ANTILOPE is based on. Then we present our new 
algorithmic approach to find the k best solutions of the ILP. 
Finally, we explain the scoring model of ANTILOPE. 



A. Graph-Theoretical Formulation 

Bartels introduced the transformation of a tandem mass 
spectrum into the so-called spectrum graph, a now commonly 
used data structure in graph-theoretical approaches to the de 
novo sequencing problem [6l, ||9l, lfT6l . see also Fig.[T] Using 
this data structure, the original problem amounts to finding a 
longest path with certain properties in this graph. 

When a peptide P is fragmented by collision induced dis- 
sociation (CID) it usually breaks along the backbone between 
two neighboring amino acids into a pair of N-terminal (prefix) 
and C-terminal (suffix) fragments. We define the residual 
mass of P as the sum of the monoisotopic masses of all 
amino acid residues in P. By parent mass Mp we denote 
the total mass of P, which is the residual mass, plus 18 Da 
for an additional water molecule. Depending on the exact 
fragmentation position, different types of fragment ions are 
produced that have a certain mass offset compared to the prefix 
residue mass (PRM) or suffix residue mass. Besides the types 
presented in Fig.|2] also neutral loss variants, e.g., loss of water 
or ammonia, of several ion types are observed frequently as 
well as multiply charged ions. The fragmentation process is 
still not fully understood and which types are generated with 
which intensity depends on many factors. 

The spectrum graph G, consists of a set of nodes V , a set 




Fig. 2. Peptide fragmentation along the backbone. This figure displays the 
most prominent fragmentation positions for the generation of pairs of b/y-ion, 
a/x-ion and c/z-ion in the backbone of a peptide. 
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of directed edges iJ^) and a set of undirected edges Ejj. In 
the original definition the spectrum graph does not contain the 
set of undirected edges, which is a slight modification by Liu 
and Cai llT6l who termed this the extended spectrum graph. In 
the spectrum graph each node corresponds to some possible 
prefix residue mass of the peptide to be identified. Directed 
edges represent amino acids and connect nodes if their mass 
difference can be explained by some amino acid. Two nodes 
that lead to contradicting interpretations of some mass peak 
are called complementary and are connected by an undirected 
edge. 

Given the tandem mass spectrum of some unknown peptide 
the construction of the spectrum graph is as follows: Each 
peak s with mass in the input spectrum generates a set of 
nodes. If we consider k different N-terminal ion types (e.g., b- 
ion and a-ion) with mass offsets 5i, . . . ,5^ (+1 Da for b-ions, 
—27 Da for a-ions) from the PRM, then peak s generates k 
nodes with masses TOs — . . . , — Sk- For C-terminal ion- 
types with offsets Si, ... ,6k, additional k nodes with masses 
Mp — 18 — (ttIs — ^i), . . . , Mp — 18 — (iris — 6k) are generated. 
Each of these nodes represents the prefix residue mass under 
the assumption that s was generated by an ion of a certain 
type. Clearly at most one of these nodes can represent the true 
PRM, therefore they are all contradicting each other and are 
connected by undirected edges. Whenever the mass difference 
of two nodes Vi and Vk equals the mass of some amino acid a 
(±e), we connect Vi and Vk via a directed edge {vi, Vk) labeled 
with a. Finally we add two so called goalpost nodes s and t, 
with masses and Pm — 18 Da, respectively. 

If the spectrum of some peptide P is complete, i.e., fragment 
ion peaks are abundant for each possible cleavage site of P, 
then there exists a node for each PRM of P. Therefore the 
correct sequence of P is obtained by finding the s-t-path of 
nodes corresponding to the true prefix sequences of P and by 
concatenation of the edge labels along this path. Each node in 
the spectrum graph has a score that represents the reliability 
of that node to correspond to a true PRM. 

However, simply looking for the longest path in the graph 
often leads to infeasible solutions, namely if two nodes that 
were generated by the same peak are included in the path, 
since in general only one of them corresponds to a true PRM. 
This problem is aggravated when the score of each node is 
directly related to the intensity of the generating peak. In 
such a scenario a high intensity peak generates several high 
scoring nodes and a longest path search then tends to include 
a pair of complementary nodes in the longest path leading to 
a contradicting N- and C-terminal interpretation of the same 
peak. Such an infeasible path is called symmetric because the 
pairs of forbidden pairs of N-terminal and C-terminal nodes 
form a symmetric structure, which can be seen in Fig. [T] To 
solve the de novo sequencing problem we hence have to search 
for antisymmetric paths. These are paths without contradicting 
nodes. They therefore do not contain pairs of nodes that are 
connected by an undirected edge. See Fig. [3] for an example. 

Most de novo sequencing algorithms generate one pair of 
complementary nodes for each peak assuming it being either 
a b-ion or y-ion. These pairs form a nested non-interleaving 
structure allowing for efficient computation. But although b- 



Fig. 3. Symmetric path example. This figure sketches schematically the 
situation when an infeasible symmetric path would be preferred over a feasible 
antisymmetric solution. Assuming that the small nodes have a score of 1 and 
the bold nodes have a score of 2, the illegal s-t path scores higher than the 
legal one. Therefore, in this example, a simple longest path search yields 
infeasible solutions. 



and y-ions are usually the most abundant in CID spectra, there 
are cases in which both of them are missing and therefore no 
correct node is generated in this case. Therefore it is promising 
to include nodes for other interpretations, especially in the low 
and high mass range of the spectrum where fragmentation is 
usually less complete. 

While the longest antisymmetric path problem is NP- 
complete for general directed graphs |T3l, there exist polyno- 
mial algorithms for the special case where the non-interleaving 
property is satisfied. The polynomial algorithm proposed by 
Chen 1 14 1 uses dynamic programming to compute an op- 
timal solution to the longest antisymmetric path with non- 
interleaving forbidden pairs. In a second paper Lu and Chen 
fTS] extended this approach to compute suboptimal solutions 
by constructing a so called matrix spectrum graph and ap- 
plying depth-first search and a backtracking algorithm. In 
contrast, the ILP formulation presented in the next section 
does not depend on such a nested structure and corresponds 
to the de novo sequencing problem for any desired set of ion 
types. 

B. Integer Linear Programming Formulation 

Our algorithm is based on the following integer linear 
programming (ILP) formulation fl^, which is very similar 
to the one Floudas and DiMaggio used for their tool PILOT 
1 11]. Our formulation models the problem by means of zero- 
one variables for each edge. We put the score of each node on 
all its outgoing directed edges. As the graph is acyclic, this is 
a safe transformation. 

max ^ Ci^kXi^k (1) 
^s,k = l (2) 
^M = l (3) 

Y ^^fc - Y ^-^J " ^ Vfc e F \ {vs, vt] (4) 

{vi,Vk)eED {vk,Vj)eEo 

Y Y ^^'^ 1 VeeEu (5) 

ViGe {vi,Vk)eED 

X^Me{0,l} (6) 

We introduce a binary variable ^ for every directed edge 
{vi, Vk) e E]j which has value one if edge {vi, Vk) is part of 
the path (active) and zero otherwise (inactive). The objective 
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function ([TJ maximizes the summed score of all active directed 
edges. For the two goalposts s and t, the two constraints (|2]i 
and (|3]l assure that exactly one active edge leaves s and one 
enters t. Together with the flow conservation constraints (|4), 
they establish a correspondence between feasible solutions of 
the ILP and s-t paths in the graph. An optimal solution of the 
ILP consisting of objective function ([TJ and constraints (|2]l, 
(|3]l and (|4]i corresponds to a longest s-t path, still possibly 
symmetric and therefore infeasible for the de novo sequencing 
problem. Therefore we add another constraint (|5) that makes 
sure that for each pair of contradicting nodes at most one 
will be selected. The difference of our model to the one 
proposed by Floudas and DiMaggio is twofold. First, we do 
not introduce variables for nodes as they are not required. This 
does not change the general structure of the formulation and 
has no strong effect on the time required for solving. Second, 
we do not formulate a constraint that prevents the exact mass 
of the predicted sequence to deviate from the measured parent 
mass by more than a certain threshold value (usually 2.5 Da). 
We argue that in our algorithm it is more promising to defer 
this filtering to a later stage of the algorithm. Since we add 
edges that correspond to pairs and triples of amino acids 
which often represent several possible combinations of amino 
acids, there is no exact mass which could be used in such a 
constraint. Therefore we perform the filtering at a later stage 
when we have created the candidate superset. 

C. Applying Lagrangian Relaxation 

While linear programming (LP) problems can be solved 
in polynomial worst case time, adding integrality constraints 
makes them generally NP-hard and the resulting integer lin- 
ear programs (ILPs) require different algorithmic solution 
approaches. One common method is to first solve the LP 
relaxation and then investigate the obtained solution. If the 
solution is fractional one has to resort to techniques like 
branch-and-bound or branch-and-cut using upper and lower 
bounds obtained from heuristics and from the relaxed solution. 

We apply a different kind of relaxation method, Lagrangian 
relaxation, which yields in many cases much more efficient 
algorithms than those based on LP relaxations because it 
can exploit structural knowledge of the problem. Lagrangian 
relaxation is motivated by the experience that many hard 
integer programming problems correspond to a significantly 
easier problem that has been complicated by an additional 
set of constraints. To obtain the efficiently computable La- 
grangian problem, the complicating constraints are removed 
and replaced by a penalty term in the objective function. The 
relaxed problem obtained that way is called the Lagrangian 
problem and can often be solved efficiently. 

The Lagrangian relaxation of the de novo sequencing 
ILP ([T]l-® is straightforward as it is very obvious that the 
antisymmetry constraints form the class of hard constraints 
that complicate the computationally easy problem of a longest 
path search in a directed acyclic graph (DAG). We can solve 
this relaxed problem by means of a simple standard algorithm, 
which can be found in reference material ll25l . To make 
the Lagrangian relaxation more transparent we rewrite the 



objective function in a way that the edge variables are grouped 
by the undirected edges incident to their left end: 



max 



E E 



Next we apply Lagrangian relaxation by dropping the anti- 
symmetry constraint (|5) and moving it to the objective function 
to penalize its violation. This leads to the Lagrangian problem 



Z[\) = max ^ ^ {ci^k - K 



IXik 



Ae (7) 



eeEtj 



Xs.k = 1 
E ^ ^ 

{vk,vt)&Eo 

XiM - Y Xkj =0 \fk eV \ {vs,vt} 

ivi,Vk)eEo {vk,Vj)eEo 

Xi,k e {0,1} 

The vector A holds the Lagrangian multipliers, non-negative 
real numbers that define the weight of the penalty term. 

Lemma 1. The Lagrangian problem (0 can be solved in 
linear time and space. 

Proof: Solving the Lagrangian problem consist of the 
following steps: First we simply subtract from each edge 
weight Ci^k the value Ag, for all undirected edges e incident 
to node Vi. Then we apply the linear time 0(|V^| + 
longest path search algorithm for DAGs on the graph with the 
modified edge weights. Finally we add the value of X^eeBu 
to the score obtained from the longest path search algorithm. 
Obviously each of the steps requires only linear time and 
space. ■ 
By restricting the Lagrangian multipliers to non-negative 
values one can easily show that the value of the solution of 
the Lagrangian problem is an upper bound to the optimal value 
of the original problem ||261 . In order to obtain a tight bound, 
the strategy is to find the values for the Lagrangian multipliers 
that minimize Z{X), which means solving the dual problem: 

Zr, = min ZiX) . 

A>0 

We apply the efficient iterative subgradient optimization al- 
gorithm, computing sequences of multipliers A* where t ~ 
0, 1, 2, . . . denotes the iteration. We start with Ag = 0, for all 
e e Eu and in each iteration t we compute the subgradients 

Sl^l- J2v,eeJ2(v„v,)eEn ^ and update 

the Lagrangian multipliers according to formula: 



A*+i =max{0,A* -6'*S'*} 



(8) 



One crucial factor with a huge influence on the performance 
is the step-size 6. The subgradient method converges to the 
optimal solution Z^ if the step-size satisfies the following 
conditions fT7\ : 

k 

lim 0'' ^0 and lim V 0' = cx) . 
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A formula that is widely used for step-size computation 
because it shows good performance in practice is given by 

where Z* is the value of the best solution to the original 
problem that was computed yet and 7* defines a decreasing 
adaption parameter 

D. Suboptimal Solutions 

A straightforward strategy to compute suboptimal solutions, 
also implemented in PILOT lillj . is to cut off previous 
solutions by an additional constraint. A known drawback of 
this approach is that solving time may increase dramatically 
after generating a few suboptimal solutions. We suggest a 
different strategy and overcome this problem by means of an 
algorithm which, for a given number k, directly computes the 
k longest paths. We use the algorithm by Yen |28| that was 
originally designed to compute the k shortest paths without 
cycles on general directed graphs. 

Yen's algorithm is a deviation algorithm based on the fact 
that the i-th shortest path Pi, will coincide with every shorter 
path Pi-i ... Pi up to some node until it deviates. The farthest 
node from the source s with this property is called deviation 
node d{Pi). 

The strategy to find the i + 1-st shortest s-t path Pi+i 
is, starting at d{Pi), to compute for each node of Pi the 
shortest path to t, that deviates from Pi at node i>* . Therefore a 
shortest path from Vj to t is computed which is not allowed to 
use the edge (w* , Wj+i). This shortest path from Vj to t is then 
concatenated with the prefix {v\ . . -Vj-i) of Pi to obtain the 
shortest s-t path that deviates from Pi at node . This path is 
added to a candidate set X. After the shortest deviating paths 
of Pi have been computed, the shortest path in the candidate 
set X corresponds to the i + 1-st shortest s-t path Pi+i and 
is removed from X. 

Yen's algorithm performs an additional trick to guarantee for 
paths without cycles that we do not discuss here. For a more 
detailed description of this algorithm and variants please refer 
to reference material ||28l , ||29l . 

Our problem differs in a few points from the original 
problem solved by Yen's algorithm, so it requires a few 
adaptations. While Yen's algorithm is designed for general 
directed graphs that may contain cycles, we are working on 
a DAG. This simplifies the problem as we do not have to 
worry about cycles and can simply transform the shortest path 
problem into a longest path problem. Note that the longest 
path problem is NP-complete in graphs with cycles. A second 
difference is that we have the additional condition to find 
antisymmetric paths. Therefore every time the shortest path 
algorithm is called in the Yen's algorithm, we replace this by 
solving the Lagrangian relaxation formulation for the longest 
antisymmetric path search. The following theorem and its 
proof capture the main algorithmic result of this paper 

Theorem 1. The combination of our Lagrangian relaxation- 
based algorithm for antisymmetric paths and a modification of 
Yen's algorithm solves the problem of computing the k longest 



antisymmetric paths in time 0{kls{\E\ + |V^|)), where I is 
the length of the longest path and s is the total number of 
subgradient iterations. 

Proof: In iteration i + 1 of Yen's algorithm the computed 
path deviating from Pi at node must satisfy two conditions 
in order to form an antisymmetric path in G. 

1) There are no two nodes in the path from to t that are 
in conflict. 

2) None of the nodes in the computed path from Vj to t is 
in conflict with some node from the prefix of path Pi 
up to node Vj. 

The first condition is satisfied by the Lagrangian relaxation 
formulation itself, because if applied to the subgraph Vj... t, 
every feasible solution corresponds to an antisymmetric path 
from node to t. To meet the second condition it is sufficient 
to remove all nodes from the subgraph Vj . . .t that are con- 
nected via an undirected edge with some node of the prefix 
s . . . of Pi before we compute the longest antisymmetric 
path. This trick also simplifies the longest antisymmetric path 
search for increasing j as the possibilities to generate an 
infeasible solution are decreasing. 

The complexity of Yen's algorithm for computing the k- 
longest paths in a DAG is 0{k\V\{\E\ + \V\)). The first factor 
V^l comes from the fact that, in a general graph, one path 
can possibly contain all \V\ nodes. In the case of peptide 
sequencing, the length of a path equals the length of the 
predicted peptide which usually does not exceed a length of 30 
for typical experimental settings. In the longest antisymmetric 
path version using Lagrangian relaxation, the 0{\E\ + \V\) 
DAG longest path algorithm gets iteratively called during 
subgradient optimization algorithm. Therefore the complexity 
of our formulation for identification of a peptide containing I 
amino acids is 0{kls{\E\ + \V\)) with s being the number of 
iterations during subgradient optimization. ■ 

Note that the value of s is possibly exponential if the 
subgradient optimization does not converge and the complete 
branch and bound tree has to be enumerated. Nevertheless, 
in the results section we will show that for our peptide 
sequencing formulation on average only very few iterations 
are required which leads to a practically efficient algorithm. 

E. Scoring Model 

We use a probabilistic scoring based on a Bayesian network 
similar to the scoring model of PepNovo. Bayesian networks 
are directed acyclic graphs where nodes represent random 
variables and the edges represent conditional dependencies 
between variables. The variables in our model are the ion 
types t ^ T that are considered by our scoring model and the 
possible values for each variable is the intensity. Therefore, as 
a first step, we normalize the intensity of all peaks to discrete 
values as defined by Dancik et al. ||9| by using their rank as 
intensity. 

The usage of Bayesian networks for scoring nodes in 
the spectrum graph is motivated by the observation that 
fragmentation events are not independent. For example, the 
probability of observing a strong b-ion is not independent of 
the abundance and intensity of the complementary y-ion. 
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Unlike for the PepNovo algorithm, where the structure of 
the probabilistic network is predefined leading to a fixed set 
of accounted conditional dependencies, we implemented a 
flexible scoring scheme where the network topology can be 
either defined by the user or it can be learned during the 
training process automatically. 

For inference and training of the Bayesian network we used 
the Bayesian Network Classifiers in the machine learning suite 
Weka [30 1 . Similar to PepNovo, we discretize the relative 
position of a cleavage into several (default 3) equally sized 
regions r to account for the different intensity distributions 
in the center and terminal regions usually observed in CID 
spectra. For each of the regions we train a Bayesian network 
using some training set of tandem mass spectra with known 
peptide identification. For each training spectrum we construct 
the node set of the spectrum graph and select an equal number 
of true positives (vertices representing a true PRM) and false 
positives (vertices representing not representing PRM). For 
each of the selected nodes we look for witnessing peaks at 
their calculated positions and record their normalized intensity 
to obtain the training vectors for the Bayesian network. Each 
of the training vectors has one additional entry, the class label, 
which is T for true positives and F for false positives. We 
select only those ion types of the witness set for the network 
training that appear in at least t percent of the true positive 
samples of the training set where the threshold parameter t 
can be defined by the user For each of these selected types 
we then add a node in the Bayesian network. One additional 
node for the class is added. During the network training, the 
structure (set of directed edges) of the network is learned and 
once the structure is fixed the conditional probability tables are 
learned from the training data. While the user can control a 
huge range of possible options for the Weka Bayesian network 
classifier training through our program, we set as the default 
training algorithm the K2-HillClimber and the Bayesian metric 
for local scoring [31]. For a user defined network topology the 
first step is skipped and only the conditional probability tables 
are computed. 

Once the network is trained, we score a node v in the 
spectrum graph by looking for peaks in the spectrum at the 
calculated masses for the selected ion types to obtain the set of 
intensity observations /^'. Using the trained Bayesian network 
BN we then compute the log likelihood ratio as: 

Pr(/" I BN, class = T) 



LLR(v) 



log- 



(10) 



Pr(/^' I BA^, class =F) 

where Pr(/" | BA^, class = X e {T,F}) is the probability 
of observation F under model BN when the class variable is 
set to X. In contrast to Bern and Datta who obtain their false 
positive samples from perturbation of the correct PRM we only 
take false positive PRM for which a node was created during 
the spectrum graph construction. We chose that approach since 
we want the Bayesian network to discriminate between correct 
and false nodes in the spectrum graph. By just perturbing the 
true PRM one will very likely generate false positive training 
samples containing only zero intensity entries which will never 
be the case for a node in the spectrum graph as it requires at 
least one peak to be generated. 



Additional to the Bayesian network we also use a simple 
intensity rank score Sr{v) as it is also used by INSPECT |f|. 
This score is the ratio between two probabilities, the proba- 
bility that a peak with a certain intensity rank corresponds to 
a certain ion type (e.g., a b-ion) and the the probability that a 
randomly chosen peak was generated by that ion type. As these 
values differ between different mass regions of a spectrum, we 
split the spectrum into three equally spaced mass regions and 
estimate the probabilities for each of them separately using the 
same training data as for the Bayesian network. For example, 
if we generate a node for a peak of rank 4, and this node 
interprets the peak as a b-ion, then Sii{v) is the log ratio 
between the probability that a rank 4 peak is a b-ion and the 
probability that any random peak is a b-ion. 

The final score s{v) for each node v of the spectrum graph 
is then computed as: 



s{v) ^LLR(v) + Sr{v) 



(11) 



Nodes having negative scores correspond to unreliable 
PRMs and are removed from the graph in order to reduce 
the size of the spectrum graph and speed up the candidate 
generation process. Since our formulation is working with 
edge weights, we move the node scores onto the edges, 
such that each directed edge gets the score of its left node. 
In the filtered spectrum graph we compute the predefined 
number of suboptimal solutions, each corresponding to one 
antisymmetric path. To account for missed cleavages we also 
add edges corresponding to pairs and triples of amino acids to 
the spectrum graph. For each of the generated candidates, in 
a second step, we try to resolve the pairs and triples of amino 
acids. Therefore we generate all possible combinations and 
permutations of amino acids to generate a candidate superset. 
The candidates in that superset are then re-scored by a refined 
version of a shared peaks count where we reward abundant 
witness peaks and penalize missing ones. 

Given a candidate sequence we look for witnessing peaks 
in the query spectrum and give a bonus if one was found or a 
penalty if it is missing. Further we check whether the peak is 
a primary isotopic peak, a secondary isotopic peak or a lone 
peak. A peak is called a primary isotopic peak if we find a 
child peak at offset 1 Da for a singly charged ion or 0.5 Da for 
a doubly charged ion. Equivalently a peak is called a secondary 
isotopic peak if it has a parent peak with offset -1 Da for a 
singly charged ion or -0.5 Da for a doubly charged ion. If a 
peak is neither primary isotopic nor secondary isotopic then 
it is a lone peak. If a witnessing peak is a primary peak we 
add another bonus to the score while we charge a penalty if it 
is a secondary peak. While the reward and penalty score are 
actually user parameters we will offer a generic algorithm to 
estimate reasonable values in future versions. The candidates 
are then re-ranked according to this score and the predefined 
number of candidates is returned. 

III. Results 

In this section we present and discuss our computational 
results. We compared ANTILOPE with state-of-the-art alter- 
native peptide identification software with respect to running 
time and quality. 
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A. Efficiency 

The major contribution of this work is the new algorithmic 
approach based on Lagrangian relaxation. We will first give 
a thorough analysis of the performance and compare it to 
the ILP formulation (l)-(6). Like implemented in PILOT, we 
generate the suboptimal solutions by introducing additional 
constraints that cut off previous solutions. We implemented 
our algorithm in C++ and use the OpenMS [23 1 library that 
offers convenient data structures and algorithms to handle and 
manipulate spectral data. For the ILP formulation we use the 
commercial CPLEX [32] solver software (version 9.0), which 
is in general the fastest solver available. We were not able 
to directly compare to PILOT because the software is not 
available upon request. In Fig.|4]we compare the running times 
of the ILP and our Lagrangian relaxation formulation on a set 
of 100 tandem mass spectra from the ISB dataset f33\. In this 
comparison we only consider the time required to generate 
the set of candidate sequences and ignore the preprocessing 
of the spectrum and the spectrum graph generation as these 
steps are independent of the applied algorithm. We compared 
the running time required to generate the top scoring 20, 30 
and 50 candidates for each spectrum. The figure shows that 
our approach significantly outperforms the ILP formulation 
on all instances and the performance gain increases with the 
number of candidates to be generated. Our algorithm is on 
average « 9 times faster for the best 20 candidates, for 30 and 
50 candidates the average advantage increases to a factor of 
?a 12 and w 18. While the run time for of the ILP formulation 
for the top 50 candidates was usually above 2 seconds, the 
Lagrangian relaxation formulation requires on average only a 
few tenths of a second. 

In a closer analysis we investigated the convergence behav- 
ior of our Lagrangian relaxation formulation. It reveals that 
for each Lagrange problem solved during the path ranking 
algorithm only very few iterations of the subgradient opti- 
mization are required. The path ranking algorithm maintains a 
list of previously detected candidate paths together with their 
scores. Since the score Z{X) of the Lagrange problem is an 
upper bound to the score Zip of the best possible feasible 
solution, subgradient optimization can be aborted as soon as 
Z(A) falls below the lowest score in the candidate list. If the 
Lagrangian relaxation does not converge and cannot be aborted 
after 100 iterations we apply a branching step. We use the best 
infeasible path found during the subgradient optimization and 
arbitrarily choose one node vt involved in a conflict. Then we 
generate two subproblems, one forcing vi, to be in the path 
and one forbidding Vf, to be selected. We found that only for a 
very small fraction of the Lagrange problems a branching step 
had to be performed, and the depth of branch-and-bound trees 
never exceeded a value of three. It is necessary to mention 
that the performance strongly depends on the scoring function 
used, since a good scoring function will not generate many 
high scoring nodes for the same peak and only the correct 
one should receive a significantly high score. Therefore a 
good scoring function does not only affect the identification 
performance but also affects the complexity of the candidate 
generation. 
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Fig. 4. Running time comparison between Lagrangian relaxation and 
ILP fonnulation for computation of 20, 30, and 50 suboptimal solutions 
of 100 benchmark spectra. Box-and-whisker plots display median, quar- 
tiles, and extrema of the distribution of relative performance gains rpg = 
run-time(ILP)/mn-time(Lagrange) . ANTILOPE outperforms the CPLEX-based 
method for all spectra and all numbers of suboptimal solutions. The advantage 
increases with the number of suboptimal solutions. The considered spectrum 
graphs contained between 80 and 200 nodes. 



B. Sequencing Performance 

We compared the performance of ANTILOPE to four 
non-commercial de novo sequencing tools, LutefiskXP, 
NovoHMM, PILOtQ PepNovo and the commercial software 
PEAK^. We used two measures, accuracy and recall, to 
assess their performance. Accuracy denotes the fraction of cor- 
rectly predicted amino acid residues compared to all predicted 
residues. Recall is the fraction of correctly predicted residues 
compared to the total number of residues of the correct 
peptide sequences. When looking at suboptimal solutions for 
each algorithm we looked for the prediction with the highest 
recall and reported the values of this prediction for recall 
and precision. In case of multiple predictions with the same 
recall value we report the values for the one with the highest 
precision among them. As benchmark set, we chose tandem 
mass spectra from the ISB dataset [33] that were generated 
by an ESI-ion trap mass spectrometer by Thermo Finnigan 
and spectra from the open proteomics database. This set of 
reliably annotated spectra from tryptic peptides has already 
been used for training of PepNovo and NovoHMM. We created 
a training set of 1214 spectra from doubly charged precursor 
ions of unique peptides to train the scoring model. During the 
Bayesian network training for each of the 3 mass sectors, only 
the ion types that had a peak in at least 20% of the true positive 
training samples are selected for the corresponding Bayesian 
network. The topologies of the Bayesian networks together 



As PILOT was not available, the identifications for the test data were 
generated by the authors of PILOT. 

-We used the PEAKS Online 2.0 web interface. 
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Fig. 5. Benchmark. Comparison of accuracy and recall of ANTILOPE with NovoHMM, PepNovo, PILOT, PEAKS and LutefiskXP. We compare the accuracy 
and recall of the best prediction among the top 1, 3, 5 and 10 ranked candidates returned by each tool. As the best prediction we consider the one with the 
best recall among the candidates. Since NovoHMM generates only one candidate per spectrum it appears only in the first plot. Discussion in text. 



with a brief discussion can be found in the supplementary ma- 
terial. The parameters to score the peptide spectrum matches 
for the candidate sequences in the superset were chosen as 
follows: For an abundant b- or y-ion we awarded the score 
PSMb — PSMy = 1, doubly charged b- or y-ions scored 
0.5, a-ions 0.3 and all neutral losses were awarded a score of 
0.2. Isotopic peaks for some type t were awarded a score of 
PSMt ■ 0.2. If some peak was missing the penalty of PSMt ■ 0.5 
was subtracted from the score. When a peak was classified 
as a secondary peak its score is reduced to PSMt 0.8. The 
score for some peak is then weighted with the relative m/z 
distance between the expected and the observed m/z value 
using a linear function. 

The test set consists of 200 spectra of peptides (peptides 
in training and test dataset disjunct) with a molecular mass 
of at most 1600 Da and an average peptide length of 10 
residues. We score a predicted amino acid as correct, if its 
predicted starting mass position does not deviate by more than 
2.5 Da from the correct starting mass position. Further, in our 
evaluation we do not discriminate between the amino acids 
Q/K and and I/L since their masses cannot be distinguished. 

To compare the tools we do not only look at the top hit, 
but we also look at the accuracy and recall for the best hit 
in the top 3, 5 and 10 candidates. The results are presented 
in Fig. |5] Since NovoHMM only generates one candidate per 
spectrum it appears only in the first plot. Looking only at the 
top hit, the recall of ANTILOPE (« 73.4%) is only marginally 
lower than of PEAKS (w 73.7%) but slightly better than that 
of PILOT (« 71.5%), NovoHMM (« 70.2%) and PepNovo 
(« 69.4%). The recall of LutefiskXP (« 60.6%) is much lower 
than for all other tools. Since ANTILOPE and NovoHMM both 
compute complete sequences, they have almost equal values 
for accuracy and recall, while for LutefiskXP and PepNovo 
these values differ as they allow for gaps in their predicted 
sequences. If we go over from the top hit to the best 3, 5 and 



10 candidates, we observe that in terms of recall, ANTILOPE is 
always very close to PepNovo and PEAKS (equal for the top 
3, 2.5% advantage of PepNovo and PEAKS for the top 10) 
and always approximately 4% better than PILOT. In terms of 
accuracy PepNovo (w 92%) has a better performance than 
ANTILOPE, PEAKS and PILOT since it allows for partial 
peptide predictions. The accuracy of LutefiskXP is slightly 
better than for ANTILOPE and PILOT but this accuracy is 
achieved at a much lower recall which is between 12% 
and 14% lower in all four cases. The four tools ANTILOPE, 
LutefiskXP, NovoHMM and PepNovo are comparable in terms 
of run time which is usually between 0.5 and 1.5 seconds 
per spectrum. The running time of PILOT as reported by 
the authors was on average around 9 seconds per spectrum. 
We cannot directly estimate the runtime of PEAKS since the 
identification is performed via a web interface. 

IV. Conclusion 

We proposed a new algorithmic approach to solve the 
longest antisymmetric path problem by means of Lagrangian 
relaxation, combined with a polynomial algorithm for subopti- 
mal solutions. Using this approach the algorithm is flexible and 
not restricted to the nested structure of the spectrum graph and 
solves this problem much faster than an LP relaxation-based 
method for the same formulation. Therefore, for our tool AN- 
TILOPE, the candidate generation is no longer the bottleneck 
as the most time consuming step is the re-ranking phase since 
the number of possible candidates can easily explode if several 
double and triple amino acid edges are selected. In terms of 
sequencing performance, ANTILOPE is akeady competitive to 
available state-of-the-art programs PepNovo and PEAKS while 
it outperforms LutefiskXP and NovoHMM especially if we 
also consider suboptimal solutions. For long peptides PepNovo 
still has a small advantage, which is mostly due to the fact 
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that the current version of ANTILOPE produces only complete 
annotations without gaps. 

Actually we only generated two nodes for each peak, one 
for a b- and one for a y-ion. Generating nodes for all ion 
types decreased the performance as this always lead to some 
high scoring, but false nodes and thus to wrong interpretations. 
Nevertheless we are sure that generating more nodes can 
lead to better identifications in combination with a refined 
scoring scheme. The algorithmic framework is flexible enough 
to work with mass spectra generated by different kind of 
mass spectrometers. So the user can define for which ion 
types a node shall be generated. This can lead to improved 
identification performance for different datasets. Combined 
with a scoring function trained on a representative set of 
spectra, the ability of our algorithm to directly model multiply 
charged ions can lead to an improvement over the other 
algorithms when analyzing tandem mass spectra obtained from 
higher charged precursor ions. 

For the future we plan to improve our algorithm in several 
directions. We will include support for identification of pep- 
tides containing post-translational modifications. Further we 
want to support combinations of complementary fragmentation 
techniques like CID together with electron transfer dissocia- 
tion (ETD) or CID with electron capture dissociation (ECD), 
which can improve the identification ||22| . If34l . In these 
applications the flexibility of our formulation may become a 
major advantage over the other programs. 

To improve the performance for spectra of longer peptides 
we will extend ANTILOPE in a way that it can produce 
partial predictions allowing for gaps at the terminals. This, 
together with a machine learning strategy for the re-scoring 
like the rank-boosting algorithm used by PepNovo, should 
lead to a further improvement. ANTILOPE is freely available 
as part upcoming releases of the open source proteomics 
library OpenMS Il23l allowing for convenient integration into 
experimental workflows. 
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